Introduction This procedure analyzes the RNA-seq data of multiple samples for quality control purpose, using the gene-level read counts. The read counts can be provided as one or multiple integer data matrixes, corresponding to different mapping types, such as unique vs. multiple mapping and sense vs. antisense strands. The following steps will be applied to the read count data:

  • Summary statistics: the read count data is summarized by sample, gene and mapping type.
  • Dispersion and normalization: different normalization methods are compared based on between-sample variance of all genes before and after normalization.
  • Sample analysis: X/Y genes are used to predict sample gender; and autosomal genes are used for unsupervised sample clustering.

 

Go to project home

1 Description

1.1 Project

Transcriptome of mouse brain tumors

1.2 Data

RNA-seq data was generated from of 2 types of mouse brain tumors. Some tumors are tested for drug response. Raw data was processed to get gene-level read counts.

1.3 Analysis

This is a demo.

2 Summary statistics

  • Number of samples: 19
  • Number of genes: 24,060
  • Average number of reads per gene: 4.e+04
  • Total number of reads per sample: 49,279,527

2.1 Read count by sample

The total read count per sample is the sum of sequence reads of one sample mapped to all genes. Inconsistency of total reads between samples might suggest data quality issues and affect downstream analysis. For example, low total read count could be caused by high level of RNA degradation or high rate of sequencing errors. Shapiro-Wilk normality test shows that the total read counts of this data set is normally distributed (p = 0.89). Samples with extreme total read counts comparing to the others:

  • Samples with extremly low read count: none
  • Samples with extremly high read count: none
plot of chunk read_count_sample_hist

Figure 1. Distribution of total read counts of all samples. Total read count per sample (millions):

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.72   44.61   50.54   49.28   54.35   64.37

2.2 Read count by gene

Due to the variability of gene length and expression level, it is expected that a large portion of the sequence reads are mapped to a small number of genes. At the same time, a large portion of the genes will have no or few reads mapped to them, making them unsuitable for statistical analysis. The distribution of read counts across genes and the consistency of such distribution between samples also provide information about RNA-seq data quality. In this data set,

  • 16.8% genes have no reads mapped to them in any sample.
  • 26.66% genes have less 1 read mapped to them per sample.
  • 32.32% genes have less 5 reads mapped to them per sample .
plot of chunk read_count_gene_sorted
Figure 2. The total read count of all samples mapped to each gene is used to sort genes from high to low. The cumulative percent of top genes is plotted along gene ranking.

 

plot of chunk read_count_gene_top

Figure 3. The cumulative read counts of top 10 genes and the percent of total reads by these genes are calculated for each sample. Samples are ordered by the percents from low to high on the x-axis in this figure. The 2 samples with the lowest and the highest cumulative percents are labeled. The cumulative percents of all samples are then compared to each other to identify samples with extremely lower or higher percents than the other samples:

  • Samples with extremely low percent of reads from top 10 genes: none
  • Samples with extremely high percent of reads from top 10 genes: T1_5

2.3 Read count by mapping type

The read-to-gene mapping could be complicated by at least 3 conditions corresponding to 8 mapping types:

  • Whether the read is mapped to one and only one gene or multiple genes: unique vs. multiple
  • Whether both ends of paired end reads are mapped to the same gene or only one end is mapped: paired vs. unpaired
  • Whether the read is mapped to the sense strand or antisense strand of a gene: sense vs. antisense

This analysis accepts multiple matching matrixes of read counts corresponding to different mapping types. By default, the first matrix corresponds to the most common mapping type and will be used for all the other analyses in this report. In this section, however, read counts of different mapping types are summarized and compared to each other when multiple matrixes are provided.

plot of chunk read_count_type_pie

Figure 4. Total mapped reads of all samples are split by mapping types, such as unique vs. multiple, paired vs. unpaired, and sense vs. antisense. Among the 2 mapping type(s),

  • The most common mapping type is Sense (97.5497% of total mapped reads)
  • The least common mapping type is Antisense, (2.4503% of total mapped reads)

 

When the gene-level read counts of two mapping types are strongly correlated to each other, they can be combined to increase total read counts and hence statistical power of data analysis. Negative or lack of correlation between mapping types might also provide useful information.

plot of chunk read_count_type_corr
Figure 5. The mapping type has the strongest correlation to the first and most common mapping type is identified based on Spearman’s correlation coefficients. The gene-level read counts of both mapping types are plotted in this figure. Click here to view correlation coefficients between all pairs of mapping types.

 

By comparing read counts of different mapping types to each other, the ratios can be compared between samples for consistency. A sample might have quality issue if it has reads of a mapping type much less or more than the other samples.

plot of chunk read_count_type_heatmap
Figure 6. The relative frequecy of mapping types are calculated for each sample, using the first read count matrix as reference. The frequency of each mapping type are normalized across all samples (Mean = 0 and SD = 1.0). This figure shows the normalized frequency of each mapping type in each sample (red = high). - Mapping type Antisense has the most “abnormally” high relative frequency (# of standard deviations = 1.59) in sample T2_4, T2_6. - Mapping type Antisense has the most “abnormally” low relative frequency (# of standard deviations = -1.57) in sample T2_4.

3 Dispersion and normalization

Gene-specific dispersion of RNA-seq read counts is commonly used to evaluate between-sample variance of a data set. Here, the dispersion is estimated by the overall pattern of coefficient of variation (standard deviation devided by average read count) of all genes.

plot of chunk dispersion
Figure 7. Left: the distribution of average gene expression level (read count divided by gene length). Right: the dispersion of gene expression level between samples, measured as the coefficient of variation. Usually, genes with lower expression will have larger dispersion on average.

There are many ways to normalize RNA-seq data to remove systematic bias between samples, usually based on the assumption that all the samples have the same global distribution. The following analysis performs several different normalization methods and evaluate their impact on data dispersion:

  • Original: The un-normalized gene-level read counts.
  • Total_Count: Rescale data so all the samples have the same total read count.
  • Median: Rescale data so all the samples have the same median read count.
  • Quantile_Quantile: Make all the samples have exactly the same distribution.
  • Upper_Quantile: Rescale data so all the samples have the same upper quantile read count.
  • Trimmed_Mean: The “weighted trimmed mean of M-values” method implemented by the edgeR package.
  • Relative_Log: The “relative log expression” method implemented by the edgeR package.
  • DESeq: The normalization method of the DESeq package.
  • FPKM: Fragments Per Kilobase of transcript per Million mapped reads.
  • TPM: Transcript per Million mapped reads.
  • Loess: Make all the samples have the similar distribution by fitting them to same Loess distribution.
  • Cyclic_Loess: The “cyclic loess” method implemented by the limma package.

Check this paper for detailed comparison of normalization methods, and this function for R code of these normalization methods.

plot of chunk normalization_boxplot
Figure 8. Each panel corresponds to a normalization method, including the boxplots of all samples. Samples are colored by their Group.
plot of chunk average_cv
Figure 9. Each boxplot corresponds to a normalization method and the overall distribution of change in coefficient of variation of all genes. The means of all genes are labeled as the read stars. Click here to view the table of coefficients of variation of all genes.

 

Table 1 Summary of coefficient of variation (CV) from each normalization method. The columns are 1) average CV of all genes; 2) correlation of CV between the original read counts and normalized read counts of all genes; 3) the number of gene with decreased CV; 4) the number of genes with increased CV; and 5) the ratio of decreased to increased genes.
Method Mean Change(%) Corr2Original Num_Decrease Num_Increase Decrease/Increase
Original 0.94 0.0000 1.00 0 0 NaN
Total_Count 0.92 -2.2981 1.00 12800 6549 1.95
Median 0.92 -1.6242 1.00 11365 8222 1.38
Quantile_Quantile 0.89 -5.1722 0.90 13254 5918 2.24
Upper_Quantile 0.91 -2.9796 1.00 14253 5266 2.71
Trimmed_Mean 0.94 0.1900 1.00 8850 10608 0.83
Relative_Log 0.94 0.1900 1.00 8850 10608 0.83
DESeq 0.92 -2.3596 1.00 13071 6353 2.06
FPKM 0.92 -2.2981 1.00 12884 6568 1.96
TPM 0.94 -0.3191 1.00 10637 8799 1.21
Loess 0.87 -7.0426 0.99 14856 5163 2.88
Cyclic_Loess 0.90 -3.7413 0.99 13802 6217 2.22
Go to project home

4 Sample analysis

This section uses read count data to perform several sample-level analyses. The results can be used to identify potential mislabeling, outlier, confounding variable, etc. Read counts normalized by the Loess method are used for all analyses in this section.

4.1 Gender prediction

When enough genes on X and Y chromosomes have detectable expression, these genes can be used to predict the gender of samples. In case the gender information is already available, the predicted gender can be used to identify potential mislabeling.

plot of chunk gender_pred_plot
Figure 10. The average of Log2(Read Count+1) is calculated for X and Y chromosomes when there are at least 5 genes having detectable expression. The averages of each sample are plotted in this figure as X and Y axes. The averages of each chromosome are used separately to cluster samples using the pamk {fpc} function. If the optimal number of clusters is 2 for both chromosomes, gender prediction will be made, and the samples having predicted gender agreed byt the two chromosomes will be labeled with red (female) or blue (male). Click here to veiw the full table of gender prediction.

4.2 Hierarchical clustering

Hierarchical clustering groups samples based on their glabal correlation of all genes. It is a iterative procedure that merges two most similar samples/groups at each step.

plot of chunk hierarchical_clustering
Figure 11. Hierarchical clustering of samples using all autosomal genes. Normalized read counts are log2-transformed before clustering. The vertical position of the common node of two samples indicates their similarity (lower = more similar).

4.3 Principal components analysis

Principal components analysis

4.3.1 Group

4.3.2 Type

4.3.3 Drug

Figure 12. PCA is performed using all autosomal genes after read counts are log2-transformed. The same plot is made in multiple tabs while samples are color-coded based on known sample features.

5 Appendix

Check out the RoCA home page for more information.

5.1 Reproduce this report

To reproduce this report:

  1. Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory

  2. To generate a new report using your own input data and parameter, edit the following items in the YAML file:

  • output : where you want to put the output files
  • home : the URL if you have a home page for your project
  • analyst : your name
  • description : background information about your project, analysis, etc.
  • input : where are your input data, read instruction for preparing them
  • parameter : parameters for this analysis; read instruction about how to prepare input data
  1. Run the code below within R Console or RStudio, preferablly with a new R session:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }

CreateReport(filename.yaml);  # filename.yaml is the YAML file you just downloaded and edited for your analysis

If there is no complaint, go to the output folder and open the index.html file to view report.

5.2 Session information

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rnaseq_0.0.0.9000       DEGandMore_0.0.0.9000  
##  [3] samr_2.0                matrixStats_0.14.2     
##  [5] impute_1.42.0           DESeq_1.20.0           
##  [7] lattice_0.20-33         locfit_1.5-9.1         
##  [9] Biobase_2.28.0          gplots_3.0.1           
## [11] fpc_2.1-10              vioplot_0.2            
## [13] sm_2.2-5.4              htmlwidgets_0.5        
## [15] DT_0.1                  yaml_2.1.13            
## [17] rmarkdown_0.9.6         knitr_1.13             
## [19] edgeR_3.10.2            awsomics_0.0.0.9000    
## [21] RoCA_0.0.0.9000         RCurl_1.95-4.8         
## [23] bitops_1.0-6            RankProd_2.42.0        
## [25] devtools_1.11.1         DESeq2_1.8.1           
## [27] RcppArmadillo_0.5.300.4 Rcpp_0.12.4            
## [29] GenomicRanges_1.22.4    GenomeInfoDb_1.6.3     
## [31] IRanges_2.4.8           S4Vectors_0.8.11       
## [33] BiocGenerics_0.16.1     limma_3.26.9           
## [35] snow_0.4-1             
## 
## loaded via a namespace (and not attached):
##  [1] RColorBrewer_1.1-2   httr_1.2.0           prabclus_2.2-6      
##  [4] tools_3.2.2          R6_2.1.2             KernSmooth_2.23-15  
##  [7] rpart_4.1-10         Hmisc_3.16-0         DBI_0.3.1           
## [10] colorspace_1.2-6     trimcluster_0.1-2    nnet_7.3-10         
## [13] withr_1.0.1          gridExtra_2.0.0      curl_0.9.7          
## [16] git2r_0.15.0         formatR_1.3          caTools_1.17.1      
## [19] diptest_0.75-7       scales_0.2.5         DEoptimR_1.0-3      
## [22] mvtnorm_1.0-3        robustbase_0.92-5    genefilter_1.50.0   
## [25] stringr_1.0.0        digest_0.6.9         foreign_0.8-66      
## [28] XVector_0.10.0       htmltools_0.3.5      highr_0.5.1         
## [31] RSQLite_1.0.0        jsonlite_0.9.22      mclust_5.0.2        
## [34] BiocParallel_1.2.20  gtools_3.5.0         acepack_1.3-3.3     
## [37] magrittr_1.5         modeltools_0.2-21    Formula_1.2-1       
## [40] futile.logger_1.4.1  munsell_0.4.2        proto_0.3-10        
## [43] stringi_1.0-1        MASS_7.3-43          zlibbioc_1.14.0     
## [46] flexmix_2.3-13       plyr_1.8.3           grid_3.2.2          
## [49] gdata_2.17.0         splines_3.2.2        annotate_1.46.1     
## [52] geneplotter_1.46.0   reshape2_1.4.1       futile.options_1.0.0
## [55] XML_3.98-1.3         evaluate_0.9         latticeExtra_0.6-26 
## [58] lambda.r_1.1.7       gtable_0.1.2         kernlab_0.9-22      
## [61] ggplot2_1.0.1        xtable_1.7-4         e1071_1.6-7         
## [64] class_7.3-13         survival_2.38-3      AnnotationDbi_1.30.1
## [67] memoise_1.0.0        cluster_2.0.3

END OF DOCUMENT